Gaussian Distribution
Univariate Gaussian Distribution
Overview
Gaussian distribution is one of the most important distributions in probability and statistics. It is a continuous probability distribution that is symmetric around the mean, with the highest probability at the mean.
Probability density function
probability density function for a univariate Gaussian distribution \(X \sim \mathcal{N}(\mu, \sigma^2)\) where \(X \in \mathbb{R}\) is given by: \[f(x) = \frac{1}{\sigma\sqrt{2\pi}} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\]
Where \(\mu\) can be any real number and \(\sigma\) must be a positive real number.
Mean and Variance
Mean and Variance are given by: \[\mathbb{E}[X] = \mu \quad \text{} \quad \operatorname{Var}[X] = \sigma^2\]
KL Divergence
KL divergence between two univariate normals \(p = \mathcal{N}(\mu_0, \sigma^2_0)\) and \(q = \mathcal{N}(\mu_1, \sigma^2_1)\) is given by: \[D_{\mathrm{KL}}(p \,\|\, q) = \frac{1}{2}\left[\frac{\sigma_1^2}{\sigma_0^2} + \frac{(\mu_1 - \mu_0)^2}{\sigma_0^2} - 1 + \ln\frac{\sigma_0^2}{\sigma_1^2}\right]\]
Mean and Variance Estimation
If we have a sample \(x_1, x_2, \ldots, x_n\) from a univariate Gaussian distribution \(X \sim \mathcal{N}(\mu, \sigma^2)\), then the mean and variance can be estimated by: \[\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i \quad \text{and} \quad \hat{\sigma}^2 = \frac{1}{n-1}\sum_{i=1}^n (x_i - \hat{\mu})^2\]
Multivariate Gaussian Distribution
Overview
Multivariate Gaussian distribution is a generalization of the univariate Gaussian distribution to multiple variables. It is a continuous probability distribution that is symmetric around the mean, with the highest probability at the mean.
Probability density function
probability density function for a multivariate Gaussian distribution \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), where \(\mathbf{X} \in \mathbb{R}^k\) is given by: \[f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)\]
Mean vector and covariance matrix
Mean vector and covariance matrix are given by: \[\mathbb{E}[\mathbf{X}] = \boldsymbol{\mu} = \begin{pmatrix}\mu_1 \\ \vdots \\ \mu_k\end{pmatrix} \qquad \boldsymbol{\Sigma} = \begin{pmatrix}\sigma_1^2 & \cdots & \sigma_{1k} \\ \vdots & \ddots & \vdots \\ \sigma_{k1} & \cdots & \sigma_k^2\end{pmatrix}\]
where \(\sigma_{ij} = \operatorname{Cov}(X_i, X_j) = \rho_{ij}\sigma_i\sigma_j\) and \(\boldsymbol{\Sigma}\) must be symmetric positive semi-definite.
Mahalanobis distance
Mahalanobis distance is given by: \[\Delta^2 = (\mathbf{x} - \boldsymbol{\mu})^\top \boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu}) \sim \chi^2(k)\]
Contours of constant density are ellipsoids satisfying \(\Delta^2 = c\).
KL divergence
KL divergence between two multivariate normals \(p = \mathcal{N}(\boldsymbol{\mu}_0, \boldsymbol{\Sigma}_0)\) and \(q = \mathcal{N}(\boldsymbol{\mu}_1, \boldsymbol{\Sigma}_1)\) is given by: \[D_{\mathrm{KL}}(p \,\|\, q) = \frac{1}{2}\left[\operatorname{tr}(\boldsymbol{\Sigma}_1^{-1}\boldsymbol{\Sigma}_0) + (\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0)^\top\boldsymbol{\Sigma}_1^{-1}(\boldsymbol{\mu}_1 - \boldsymbol{\mu}_0) - k + \ln\frac{|\boldsymbol{\Sigma}_1|}{|\boldsymbol{\Sigma}_0|}\right]\]
Marginal distributions
Any subset \(\mathbf{X}_A \subset \mathbf{X}\) of size \(m\) is itself multivariate normal and is given by: \[\mathbf{X}_A \sim \mathcal{N}(\boldsymbol{\mu}_A,\, \boldsymbol{\Sigma}_{AA})\]
Conditional distribution
Partitioning \(\mathbf{X} = (\mathbf{X}_1, \mathbf{X}_2)\), the distribution of \(\mathbf{X}_1\) given \(\mathbf{X}_2 = \mathbf{x}_2\) is given by: \[\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \;\sim\; \mathcal{N}\!\left(\boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2),\;\; \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\right)\]
The term \(\boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\) is the Schur complement of \(\boldsymbol{\Sigma}_{22}\) in \(\boldsymbol{\Sigma}\), and represents the reduction in uncertainty about \(\mathbf{X}_1\) after observing \(\mathbf{X}_2\).
Linear transformation
If \(\mathbf{Y} = A\mathbf{X} + \mathbf{b}\) for matrix \(A \in \mathbb{R}^{m \times k}\): \[\mathbf{Y} \sim \mathcal{N}(A\boldsymbol{\mu} + \mathbf{b},\;\; A\boldsymbol{\Sigma} A^\top)\]
Empirical rule
For a univariate normal random variable \(X \sim \mathcal{N}(\mu, \sigma^2)\), the empirical rule (also called the 68–95–99.7 rule or three-sigma rule) states that nearly all probability mass lies within a few standard deviations of the mean (see the illustration at the top of the page):
- About 68.2% of values fall in the interval \(\mu \pm 1\sigma\) (between one standard deviation below and above the mean).
- About 95.4% fall in \(\mu \pm 2\sigma\).
- About 99.7% fall in \(\mu \pm 3\sigma\).
These figures are exact for the normal distribution; the rule is often summarized in teaching as roughly 68%, 95%, and 99.7%. The illustration above marks those intervals along the horizontal axis under the bell curve. The rule is useful for quick mental checks (e.g., outliers beyond \(\mu \pm 3\sigma\) are very rare under exact normality) but applies exactly only when the data-generating process is normal. For heavy-tailed or skewed distributions, tail probabilities can differ markedly.